NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Evaluating Tuning Opportunities of the LLVM/OpenMP Runtime

https://doi.org/10.1109/SCW63240.2024.00131

Chheda, Smeet; Verma, Gaurav; Tian, Shilei; Chapman, Barbara; Doerfert, Johannes (November 2024, IEEE)

Full Text Available
ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels

https://doi.org/10.1109/IPDPSW63119.2024.00070

TehraniJamsaz, Ali; Mishra, Alok; Dutta, Akash; Malik, Abid M; Chapman, Barbara; Jannesari, Ali (May 2024, IEEE)

Full Text Available
Cross-Feature Transfer Learning for Efficient Tensor Program Generation

https://doi.org/10.3390/app14020513

Verma, Gaurav; Raskar, Siddhisanket; Emani, Murali; Chapman, Barbara (January 2024, Applied Sciences)

Tuning tensor program generation involves navigating a vast search space to find optimal program transformations and measurements for a program on the target hardware. The complexity of this process is further amplified by the exponential combinations of transformations, especially in heterogeneous environments. This research addresses these challenges by introducing a novel approach that learns the joint neural network and hardware features space, facilitating knowledge transfer to new, unseen target hardware. A comprehensive analysis is conducted on the existing state-of-the-art dataset, TenSet, including a thorough examination of test split strategies and the proposal of methodologies for dataset pruning. Leveraging an attention-inspired technique, we tailor the tuning of tensor programs to embed both neural network and hardware-specific features. Notably, our approach substantially reduces the dataset size by up to 53% compared to the baseline without compromising Pairwise Comparison Accuracy (PCA). Furthermore, our proposed methodology demonstrates competitive or improved mean inference times with only 25–40% of the baseline tuning time across various networks and target hardware. The attention-based tuner can effectively utilize schedules learned from previous hardware program measurements to optimize tensor program tuning on previously unseen hardware, achieving a top-5 accuracy exceeding 90%. This research introduces a significant advancement in autotuning tensor program generation, addressing the complexities associated with heterogeneous environments and showcasing promising results regarding efficiency and accuracy.
more » « less
Full Text Available
OpenMP Kernel Language Extensions for Performance Portable GPU Codes

Tian, Shilei; Scogland, Tom; Chapman, Barbara; Doerfert, Johannes (November 2023, Association for Computing Machinery)
Badia, Rosa M; Mohror, Kathryn (Ed.)
In contemporary high-performance computing architectures, the integration of GPU accelerators has become increasingly prevalent. To harness the full potential of these accelerators, developers often resort to vendor-specific kernel languages, such as CUDA. While this approach ensures optimal efficiency, it inherently compromises portability and engenders vendor dependency. Existing portable programming models, such as OpenMP, while promising, demand extensive code rewriting due to their foundamental difference from kernel languages. In this work, we introduce extensions to LLVM OpenMP, transforming it into a versatile and performance portable kernel language for GPU programming. These extensions allow for the seamless porting of programs from kernel languages to high-performance OpenMP GPU programs with minimal modifications. To evaluate our extension, we implemented a proof-of-concept prototype that contains a subset of extensions we proposed. We ported six established CUDA proxy and benchmark applications and evaluated their performance on both AMD and NVIDIA platforms. By comparing with native versions (HIP and CUDA), our results show that OpenMP, augmented with our extensions, can not only match but also in some cases exceed the performance of kernel languages, thereby offering performance portability with minimal effort from application developers.
more » « less
Full Text Available
Implementing OpenMP’s SIMD Directive in LLVM’s GPU Runtime

https://doi.org/10.1145/3605573.3605640

Wright, Eric; Doerfert, Johannes; Tian, Shilei; Chapman, Barbara; Chandrasekaran, Sunita (August 2023, ACM)

Full Text Available
Performance Study on CPU-based Machine Learning with PyTorch

https://doi.org/10.1145/3581576.3581615

Chheda, Smeet; Curtis, Anthony; Siegmann, Eva; Chapman, Barbara (February 2023, HPC Asia '23 Workshops: Proceedings of the HPC Asia 2023 Workshops)

Full Text Available
Transfer Learning Across Heterogeneous Features For Efficient Tensor Program Generation

https://doi.org/10.1145/3587278.3595644

Verma, Gaurav; Raskar, Siddhisanket; Xie, Zhen; Malik, Abid M; Emani, Murali; Chapman, Barbara (February 2023, ACM)

Full Text Available
COMPOFF: A Compiler Cost model using Machine Learning to predict the Cost of OpenMP Offloading

https://doi.org/10.1109/IPDPSW55747.2022.00074

Mishra, Alok; Chheda, Smeet; Soto, Carlos; Malik, Abid M.; Lin, Meifeng; Chapman, Barbara (May 2022, 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW))

The HPC industry is inexorably moving towards an era of extremely heterogeneous architectures, with more devices configured on any given HPC platform and potentially more kinds of devices, some of them highly specialized. Writing a separate code suitable for each target system for a given HPC application is not practical. The better solution is to use directive-based parallel programming models such as OpenMP. OpenMP provides a number of options for offloading a piece of code to devices like GPUs. To select the best option from such options during compilation, most modern compilers use analytical models to estimate the cost of executing the original code and the different offloading code variants. Building such an analytical model for compilers is a difficult task that necessitates a lot of effort on the part of a compiler engineer. Recently, machine learning techniques have been successfully applied to build cost models for a variety of compiler optimization problems. In this paper, we present COMPOFF, a cost model which uses the multi-layer perceptrons to statically estimates the Cost of OpenMP OFFloading. We used six different transformations on a parallel code of Wilson Dslash Operator to support GPU offloading, and we predicted their cost of execution on different GPUs using COMPOFF during compile time. Our results show that this model can predict offloading costs with a root mean squared error in prediction of less than 0.5 seconds. Our preliminary findings indicate that this work will make it much easier and faster for scientists and compiler developers to port legacy HPC applications that use OpenMP to new heterogeneous computing environments.
more » « less
Full Text Available
Comparing the behavior of OpenMP Implementations with various Applications on two different Fujitsu A64FX platforms

https://doi.org/10.1145/3437359.3465592

Michalowicz, Benjamin; Raut, Eric; Kang, Yan; Curtis, Tony; Chapman, Barbara; Oryspayev, Dossay (July 2021, PEARC '21: Practice and Experience in Advanced Research Computing)
null (Ed.)
The development of the A64FX processor by Fujitsu has been a massive innovation in vectorized processors and led to Fugaku: the current world’s fastest supercomputer. We use a variety of tools to analyze the behavior and performance of several OpenMP applications with different compilers, and how these applications scale on the different A64FX processors on clusters at Stony Brook University and RIKEN.
more » « less
Full Text Available
A64FX performance: experience on Ookami

https://doi.org/10.1109/Cluster48925.2021.00106

Bari, Md Abdullah; Chapman, Barbara; Curtis, Anthony; Harrison, Robert J.; Siegmann, Eva; Simakov, Nikolay A.; Jones, Matthew D. (September 2021, 2021 IEEE International Conference on Cluster Computing (CLUSTER))

Full Text Available

« Prev Next »

Search for: All records